Top-k Correlation Computation

نویسندگان

  • Hui Xiong
  • Wenjun Zhou
  • Mark Brodie
  • Sheng Ma
چکیده

Recently, there has been considerable interest in efficiently computing strongly correlated pairs in large databases. Most previous studies require the specification of a minimum correlation threshold to perform the computation. However, it may be difficult for users to provide an appropriate threshold in practice, since different data sets typically have different characteristics. To this end, in this paper, we propose an alternative task: finding the top-k strongly correlated pairs. Consequently, we identify a 2-D monotone property of an upper bound of φ correlation coefficient and develop an efficient algorithm, called TOP-COP to exploit this property to effectively prune many pairs even without computing their correlation coefficients. Our experimental results show that TOP-COP can be an order of magnitude faster than alternative approaches for mining the top-k strongly correlated pairs. Finally, we show that the performance of the TOP-COP algorithm is tightly related to the degree of data dispersion. Indeed, the higher the degree of data dispersion, the larger the computational savings achieved by the TOP-COP algorithm.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Top-K Correlation Sub-graph Search in Graph Databases

Recently, due to its wide applications, (similar) subgraph search has attracted a lot of attentions from database and data mining community, such as [13, 18, 19, 5]. In [8], Ke et al. first proposed correlation sub-graph search problem (CGSearch for short) to capture the underlying dependency between subgraphs in a graph database, that is CGS algorithm. However, CGS algorithm requires the speci...

متن کامل

Efficiently Processing of Top-k Typicality Query for Structured Data

This work presents a novel ranking scheme for structured data. We show how to apply the notion of typicality analysis from cognitive science and how to use this notion to formulate the problem of ranking data with categorical attributes. First, we formalize the typicality query model for relational databases. We adopt Pearson correlation coefficient to quantify the extent of the typicality of a...

متن کامل

Aggregation-Aware Top-k Computation for Full-Text Search

A typical scenario in information retrieval and web search is to index a given type of items (e.g., web pages, images) and provide search functionality for them. In such a scenario, the basic units of indexing and retrieval are the same. Extensive study has been done for efficient top-k computation in such settings. This paper studies top-k processing for many emerging scenarios: efficiently re...

متن کامل

RWTH Aachen University , I 5 Max - Planck - Institut für Informatik , AG 5 Holistic Top - k

Querying large data sets is a challenging task in today’s information systems. Users are typically interested in the k most relevant results, namely the first page (e.g., the Google search engine) of the given result set. That is, given a dataset D, and user defined similarity function f, we are interested in calculating the top-k , i.e., the k highest ranked results (answers). Finding the top-...

متن کامل

Discovery of Top-k Dense Subgraphs in Dynamic Graph Collections

Dense subgraph discovery is a key issue in graph mining, due to its importance in several applications, such as correlation analysis, community discovery in the Web, gene co-expression and protein-protein interactions in bioinformatics. In this work, we study the discovery of the top-k dense subgraphs in a set of graphs. After the investigation of the problem in its static case, we extend the m...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • INFORMS Journal on Computing

دوره 20  شماره 

صفحات  -

تاریخ انتشار 2008